Intra-Lingual and Cross-Lingual Prosody Modelling

نویسنده

  • Gopala Krishna Anumanchipalli
چکیده

Statistical Parametric Speech Synthesis (SPSS) offers flexibility and computational advantage compared to other methods for Text-to-Speech Synthesis. While the speech output is intelligible, statistically trained voices are less natural due to the amount of signal processing and statistical averaging that goes into building the models. Much of the blame for the lack of naturalness falls on the inappropriate and monotonous prosody in synthesized speech. The voice source, which directly effects the prosody, is a complementary source of information than the vocal tract and has its own patterns that need to be dealt with appropriately. Under this hypothesis, this thesis investigates the representations and optimal strategies for prosody modelling within the SPSS paradigm. We propose the Statistical Phrase/Accent Model (SPAM) of intonation as a framework that is both (i) a computational model with associated training and synthesis methods for prosody and (ii) has strong theoretical basis for prosodic description. The SPAM framework combines the strengths of existing complementary views of intonation like Autosegmental Metrical Phonology, Production paradigms like the Fujisaki Model and purely computational approaches like the TILT model. We demonstrate Accent Groups, a new data derived phonological unit, as the optimal representational level to model Pitch accents and integrate it within a multi-tier phonological model to synthesize natural and expressive intonation contours. In addition to improving text-to-speech synthesis, the framework is shown to improve voice conversion, both intra-lingually across speakers, and cross-lingually across languages. We apply the proposed techniques on synthesis of Audiobooks by incorporating richer semantic and contextual features beyond the sentence. We also look at the closely related problem of voice conversion within the SPAM framework to more effectively capture the speaking style of a target speaker. The techniques are also applied for the case of cross-lingual voice conversion, in the context of speech-to-speech machine translation which aims to automatically dub a video into a target language, while preserving the speaker’s intent in the original language after translation. Appropriate objective and subjective evaluations are conducted to show the performance of the proposed techniques.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Multi-Lingual Emotion Recognition Using Auditory Attention Features

In this paper, we build mono-lingual and cross-lingual emotion recognition systems and report performance on English and German databases. The emotion recognition system uses biologically inspired auditory attention features together with a neural network for learning the mapping between features and emotion classes. We first build mono-lingual systems for both Berlin Database of Emotional Spee...

متن کامل

Non-native speech synthesis preserving speaker individuality based on partial correction of prosodic and phonetic characteristics

This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or HMM-based speech synthesis, which synthesizes foreign language speech of a specific non-native speaker reflecting the speaker-dependent acoustic characteristics extracted from the speaker’s natural speech in his/h...

متن کامل

Polyglot speech prosody control

Within a polyglot text-to-speech synthesis system, the generation of an adequate prosody for mixed-lingual texts, sentences, or even words, requires a polyglot prosody model that is able to seamlessly switch between languages and that applies the same voice for all languages. This paper presents the first polyglot prosody model that fulfills these requirements and that is constructed from indep...

متن کامل

Cross - Lingual Voice Conversion

CROSS-LINGUAL VOICE CONVERSION Cross-lingual voice conversion refers to the automatic transformation of a source speaker’s voice to a target speaker’s voice in a language that the target speaker can not speak. It involves a set of statistical analysis, pattern recognition, machine learning, and signal processing techniques. This study focuses on the problems related to cross-lingual voice conve...

متن کامل

Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Crosslingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker’s natural speech uttered in his/her mother tongue. Although the technique holds promise t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013